During the summer of 2012, wild fires ravaged throughout the Algerian territory covering most of the northern part, especially the coastal cities. This disaster was due to the higher than average temperatures which reached as high as 50 degrees Celcius.
One important measure against the reproduction of such disasters is the ability to predict their occurrence. Moreover, in this project, we will attempt to predict these forest fires based on multiple features related to weather indices.
The Dataset we will use to train and test our models consists of 244 observations on two Algerian Wilayas (cities): Sidi-Bel Abbes and Bejaia. The observations have been gathered throughout the duration of 4 months from June to September 2012 for both cities.
The Dataset contains the following variables:
We first start off by importing the necessary libraries for our analysis.
The libraries we used are the following:
The Dataset provided to us was in the form of a .csv file that contained two tables, one table for the observations belonging to the Sidi-Bel Abbes region, and the other for Bejaia.
Before starting our analysis we separated the tables into two distinct files according to the region. We named both files Algerian_forest_fires_dataset_Bejaia.csv and Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv for Bejaia and Sidi-Bel Abbes respectively.
We first check the existence of null values in the Dataset, none were found.
colSums(is.na(df_s))
day month year Temperature RH Ws Rain
0 0 0 0 0 0 0
FFMC DMC DC ISI BUI FWI Classes
0 0 0 0 0 0 0
We then process to add a column in both datasets to indicate the region(Wilaya) in each table. We chose the following encoding:
After that, we proceed to merge both our datasets into one single dataframe using full_join(), this will allow us to easily explore and analyze the data.
str(df)
'data.frame': 244 obs. of 15 variables:
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 6 6 6 6 6 6 6 6 6 6 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Temperature: int 32 30 29 30 32 35 35 28 27 30 ...
$ RH : int 71 73 80 64 60 54 44 51 59 41 ...
$ Ws : int 12 13 14 14 14 11 17 17 18 15 ...
$ Rain : num 0.7 4 2 0 0.2 0.1 0.2 1.3 0.1 0 ...
$ FFMC : num 57.1 55.7 48.7 79.4 77.1 83.7 85.6 71.4 78.1 89.4 ...
$ DMC : num 2.5 2.7 2.2 5.2 6 8.4 9.9 7.7 8.5 13.3 ...
$ DC : num 8.2 7.8 7.6 15.4 17.6 26.3 28.9 7.4 14.7 22.5 ...
$ ISI : num 0.6 0.6 0.3 2.2 1.8 3.1 5.4 1.5 2.4 8.4 ...
$ BUI : num 2.8 2.9 2.6 5.6 6.5 9.3 10.7 7.3 8.3 13.1 ...
$ FWI : num 0.2 0.2 0.1 1 0.9 3.1 6 0.8 1.9 10 ...
$ Classes : chr "not fire " "not fire " "not fire " "not fire " ...
$ Region : num 1 1 1 1 1 1 1 1 1 1 ...
unique(df$month)
[1] 6 7 8 9
We check again for any NA values that might have been introduced into the dataset by merging the data from both tables, we found out there was one row that contained NA value in DC and FWI. We delete that row since it will not affect our overall dataset.
dim(df)
[1] 243 15
We now proceed to display the different range of values some categorical variables might contain, mainly the Classes and the Region columns.
unique(df$Region)
[1] 1 0
We find that the Classes column has values that contain unneeded space characters, we proceed to trim those spaces.
df$Classes <- trimws(df$Classes, which = c("both"))
unique(df$Classes)
[1] "not fire" "fire"
We then turn the fire/not fire values into 1/0 respectively for future analysis.
```{r}
Error: attempt to use zero-length variable name
str(df)
'data.frame': 243 obs. of 15 variables:
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 6 6 6 6 6 6 6 6 6 6 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Temperature: int 32 30 29 30 32 35 35 28 27 30 ...
$ RH : int 71 73 80 64 60 54 44 51 59 41 ...
$ Ws : int 12 13 14 14 14 11 17 17 18 15 ...
$ Rain : num 0.7 4 2 0 0.2 0.1 0.2 1.3 0.1 0 ...
$ FFMC : num 57.1 55.7 48.7 79.4 77.1 83.7 85.6 71.4 78.1 89.4 ...
$ DMC : num 2.5 2.7 2.2 5.2 6 8.4 9.9 7.7 8.5 13.3 ...
$ DC : num 8.2 7.8 7.6 15.4 17.6 26.3 28.9 7.4 14.7 22.5 ...
$ ISI : num 0.6 0.6 0.3 2.2 1.8 3.1 5.4 1.5 2.4 8.4 ...
$ BUI : num 2.8 2.9 2.6 5.6 6.5 9.3 10.7 7.3 8.3 13.1 ...
$ FWI : num 0.2 0.2 0.1 1 0.9 3.1 6 0.8 1.9 10 ...
$ Classes : num 0 0 0 0 0 1 1 0 0 1 ...
$ Region : num 1 1 1 1 1 1 1 1 1 1 ...
We delete the year column since all observations were performed in the same year
str(df_scaled)
'data.frame': 243 obs. of 14 variables:
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 6 6 6 6 6 6 6 6 6 6 ...
$ Temperature: num -0.042 -0.593 -0.869 -0.593 -0.042 ...
$ RH : num 0.604 0.739 1.211 0.132 -0.138 ...
$ Ws : num -1.243 -0.887 -0.531 -0.531 -0.531 ...
$ Rain : num -0.0314 1.6159 0.6175 -0.3809 -0.281 ...
$ FFMC : num -1.4455 -1.5431 -2.0309 0.1085 -0.0517 ...
$ DMC : num -0.983 -0.967 -1.007 -0.765 -0.7 ...
$ DC : num -0.865 -0.873 -0.878 -0.714 -0.668 ...
$ ISI : num -0.997 -0.997 -1.069 -0.612 -0.708 ...
$ BUI : num -0.976 -0.969 -0.99 -0.779 -0.716 ...
$ FWI : num -0.919 -0.919 -0.932 -0.811 -0.825 ...
$ Classes : num 0 0 0 0 0 1 1 0 0 1 ...
$ Region : num 1 1 1 1 1 1 1 1 1 1 ...
We have ended up with a clean and scaled dataframe named df_scaled, which we will use to visualize and further explore our data.
Our first instinct is to compare the two regions together in terms of number of fires, and average temperature.
We used the unscaled dataset to plot the real life values of the temperatures.
df %>%
group_by(Region) %>%
summarise(Region = Region, Number_of_fires = sum(Classes), Temperature = mean(Temperature)) %>%
ggplot(aes(x=Region, y=Number_of_fires, fill = Temperature))+
geom_col(position='dodge')
`summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
We can see that the the Sidi-Bel Abbes region has in total a greater number of fires and a higher average temperature throughout the summer of 2012.
The previous results push us to suspect a positive relationship between the temperature and the likelihood of having a fire. However, we need to investigate all the other variables, which is why we will plot a correlation matrix of the features in the dataset.
We performed feature selection using the Caret package to determine which features are the most important and which are the least.
In this case, we opted for Linear Discriminant Analysis with Stepwise Feature Selection by specifying stepLDA as our method.
The varImp function returns a measure of importance out of 100 for each of the features. According to the official Caret documentation, the importance metric is calculated by conducting a ROC curve analysis on each predictor; a series of cutoffs is applied to the predictor data to predict the class. The AUC is then computed and is used as a measure of variable importance.
We can see that the variables month, Ws, Region, and day are insignificant compared to other features. We will disregard them in our model. To determine this we used a threshold of 0.7 for the importance measure.
For the following models, we will only use the features that were the most significant in our feature selection phase. The selected features are:
We begin by splitting the data into train/test sets with a 80/20 split. This split was chosen by default as a good practice. This will leave us with 191 observations in the training set as well as 52 in the test set. Due to the small nature of the dataset at hand we will later apply cross validation to some models in order to further examine their performance and compare them with each other.
We set a seed of 1000
dim(test_set)
[1] 52 14
Logistic Regression is considered to be an extension of Linear Regression, in which we predict the qualitative response for an observation. It gives us the probability of a certain observation belonging to a class in binomial classification, but can also be extended to be used for multiple classifications.
We first start by fitting our model on the training set. As we do that we get an error that our model did not converge, this is due to our model being able to perfectly split the dataset into positive/negative observations. This might soud counterintuitive but this error is a good sign.
logistic_model
Call: glm(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC +
ISI + BUI + FWI + RH, family = "binomial", data = train_set)
Coefficients:
(Intercept) Temperature Rain FFMC DMC DC ISI BUI FWI RH
195.74 -42.31 50.32 114.50 -57.81 65.45 301.95 -38.43 155.10 -17.28
Degrees of Freedom: 190 Total (i.e. Null); 181 Residual
Null Deviance: 261.5
Residual Deviance: 1.428e-07 AIC: 20
Since logistic regression gives us the probability of each observation belonging to the 1 class, we will use a 0.5 threshold to transform that probability into a classification of either 0 or 1.
After getting our predictions, we will use the confusion matrix function from the caret library that computes a set of performance matrices including f1-score, recall and precision. Other matrices computed include: sensitivity, specificity, prevalence etc. The official documentation for this function and the formulas for all matrices are found in this link. We will only be interested in the f1-score, recall, precision, accuracy and balanced accuracy.
Our model gives us an accuracy and an f1 score of 100% on the training set.
train_set$Classes
[1] 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1 1
[83] 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[165] 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0
On the test set hovewer, we get an accuracy of 98.08% and an f1 score of 98.25%.
test_set$Classes
[1] 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 0
[41] 0 1 1 1 1 0 0 0 0 0 0 1
Levels: 0 1
As we plot the ROC curve, we can see that the AUC is equal to 98.27586% which is almost a perfect classifier.
Linear Discriminant Analysis is best used when the decision boundary of our given dataset is assumed to be linear. There are two basic assumptions that LDA takes into consideration:
Since LDA assumes that each input variable has the same variance, we will use the standardized data-frame in the train test splits. Each variable in the standardized data-frame has mean of 0 and variance of 1.
lda_model
Call:
lda(Classes ~ Temperature + Rain + FFMC + DMC + DC + ISI + BUI +
FWI + RH, data = train_set, family = "binomial")
Prior probabilities of groups:
0 1
0.434555 0.565445
Group means:
Temperature Rain FFMC DMC DC ISI BUI FWI RH
0 -0.6330807 0.3721387 -0.8343419 -0.6612676 -0.5799594 -0.8280654 -0.6669278 -0.8161663 0.4701120
1 0.4633549 -0.3244797 0.6739891 0.4867494 0.4267318 0.6591273 0.4850895 0.6262935 -0.3699445
Coefficients of linear discriminants:
LD1
Temperature 0.08978603
Rain 0.18218660
FFMC 1.35842928
DMC -1.15810129
DC -0.32744874
ISI 0.37584652
BUI 1.09904453
FWI 0.92679809
RH 0.57487142
On our training data, the model reached an accuracy of 95.81% and an f1 score of 96.30%, with 4 false positives and 1 false negative.
confusionMatrix(preds_lda$class, train_set$Classes,
mode = "everything",
positive="1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 79 4
1 4 104
Accuracy : 0.9581
95% CI : (0.9191, 0.9817)
No Information Rate : 0.5654
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9148
Mcnemar's Test P-Value : 1
Sensitivity : 0.9630
Specificity : 0.9518
Pos Pred Value : 0.9630
Neg Pred Value : 0.9518
Precision : 0.9630
Recall : 0.9630
F1 : 0.9630
Prevalence : 0.5654
Detection Rate : 0.5445
Detection Prevalence : 0.5654
Balanced Accuracy : 0.9574
'Positive' Class : 1
As we can see below, our the number of false positives is 1, and the number of false negatives is 1 as well. Our model also yielded an accuracy of 96.15% and an f1 score of 96.55%.
confusionMatrix(preds_lda$class, test_set$Classes,
mode = "everything",
positive="1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 22 1
1 1 28
Accuracy : 0.9615
95% CI : (0.8679, 0.9953)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 5.691e-11
Kappa : 0.922
Mcnemar's Test P-Value : 1
Sensitivity : 0.9655
Specificity : 0.9565
Pos Pred Value : 0.9655
Neg Pred Value : 0.9565
Precision : 0.9655
Recall : 0.9655
F1 : 0.9655
Prevalence : 0.5577
Detection Rate : 0.5385
Detection Prevalence : 0.5577
Balanced Accuracy : 0.9610
'Positive' Class : 1
The AUC for LDA was 96.10%, similar to the one for Logistic Regression.
Quadratic Discriminant Analysis is best used when the decision boundary of our given dataset is assumed to be non-linear. Similarly to LDA, QDA makes two basic assumptions:
qda_model
Call:
qda(Classes ~ Temperature + Rain + FFMC + DMC + DC + ISI + BUI +
FWI + RH, data = train_set)
Prior probabilities of groups:
0 1
0.434555 0.565445
Group means:
Temperature Rain FFMC DMC DC ISI BUI FWI RH
0 -0.6330807 0.3721387 -0.8343419 -0.6612676 -0.5799594 -0.8280654 -0.6669278 -0.8161663 0.4701120
1 0.4633549 -0.3244797 0.6739891 0.4867494 0.4267318 0.6591273 0.4850895 0.6262935 -0.3699445
[Interpretation on the coefficients]
Our model yields an accuracy of 98.43% and an f1 score of 98.62% on the training set.
confusionMatrix(preds_qda$class, train_set$Classes,
mode = "everything",
positive="1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 81 1
1 2 107
Accuracy : 0.9843
95% CI : (0.9548, 0.9967)
No Information Rate : 0.5654
P-Value [Acc > NIR] : <2e-16
Kappa : 0.968
Mcnemar's Test P-Value : 1
Sensitivity : 0.9907
Specificity : 0.9759
Pos Pred Value : 0.9817
Neg Pred Value : 0.9878
Precision : 0.9817
Recall : 0.9907
F1 : 0.9862
Prevalence : 0.5654
Detection Rate : 0.5602
Detection Prevalence : 0.5707
Balanced Accuracy : 0.9833
'Positive' Class : 1
As we can see below, our the number of false positives is 1, and the number of false negatives is 1. The results are very good but the other way around would have been better as we do not want to miss any positives meaning we want to predict all fires. Our model yielded an f1-score of 93.33% and an accuracy of 92.31%.
confusionMatrix(preds_qda$class, test_set$Classes,
mode = "everything",
positive="1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 20 1
1 3 28
Accuracy : 0.9231
95% CI : (0.8146, 0.9786)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 7.729e-09
Kappa : 0.8427
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.9655
Specificity : 0.8696
Pos Pred Value : 0.9032
Neg Pred Value : 0.9524
Precision : 0.9032
Recall : 0.9655
F1 : 0.9333
Prevalence : 0.5577
Detection Rate : 0.5385
Detection Prevalence : 0.5962
Balanced Accuracy : 0.9175
'Positive' Class : 1
After plotting the ROC curve we got an AUC of 91.75%, which is worse than both logistic regression and LDA.
We can observe that QDA performs better than LDA on the training data, because it has the tendency to over-fit it. However, LDA performs better on the testing data since it generalizes better on unseen data points.
In this section, we will explore KNN’s performance on our problem. We will use hyper parameter tuning to determine the best number of nearest numbers (K) and we will also use repeated cross validation in our training for better performance estimation.
Since KNN is a distance based model, we will here again use our normalized dataset instead of the original.
The summaryFunction argument determines which metric to use to determine the performance of a particular hyperparameter setting. Here we shall use defaultSummary which calculates accuracy and kappa statistic.
We have opted to go with the repeated 10 fold cross-validation method repeated 10 times. ClassProbs parameter is set to TRUE and we can set the threshold later when we test our model performance.
training_control <- trainControl(method = "repeatedcv",
summaryFunction = defaultSummary,
classProbs = TRUE,
number = 10,
repeats = 10)
Now we use the train() function to perform the model training/tuning of the k hyper-parameter. The range of k is from 3 to 85 in steps of 2 meaning we will only have odd values of k only as it is best practice for the KNN clustering.
Another tweak that we need to make on our data-set is to change our target variable values to valid R variable names in order for the KNN algorithm to work with class Probabilities as each values of our target variable will become a variable with its own probability values. Leaving the values as {0,1} will throw an error at us, therefore we will set our Classes variable values back to ‘fire’ and ‘not_fire’ and proceed.
train_set_names$Classes
[1] not_fire not_fire not_fire not_fire fire fire not_fire not_fire fire fire not_fire not_fire not_fire not_fire not_fire not_fire fire not_fire
[19] not_fire fire fire fire fire not_fire fire fire fire fire fire fire fire fire not_fire fire fire fire
[37] fire fire fire fire fire fire not_fire not_fire fire fire fire not_fire fire fire fire fire not_fire not_fire
[55] fire fire fire fire fire fire fire fire fire fire fire fire fire fire not_fire fire fire fire
[73] not_fire not_fire fire not_fire not_fire fire not_fire not_fire fire fire fire fire fire fire fire not_fire fire not_fire
[91] not_fire fire not_fire not_fire not_fire not_fire not_fire not_fire not_fire fire fire not_fire not_fire fire fire not_fire not_fire not_fire
[109] not_fire not_fire not_fire not_fire fire fire fire fire fire fire not_fire fire not_fire not_fire fire not_fire fire not_fire
[127] not_fire not_fire not_fire not_fire not_fire not_fire fire fire fire not_fire not_fire not_fire fire fire fire fire fire not_fire
[145] not_fire fire fire fire fire not_fire fire fire fire fire fire fire fire fire fire fire fire fire
[163] fire fire fire fire not_fire not_fire not_fire not_fire not_fire fire not_fire not_fire not_fire not_fire not_fire not_fire not_fire not_fire
[181] fire fire fire fire fire not_fire not_fire not_fire not_fire not_fire not_fire
Levels: not_fire fire
knn_cv
k-Nearest Neighbors
191 samples
9 predictor
2 classes: 'not_fire', 'fire'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 172, 172, 172, 172, 172, 172, ...
Resampling results across tuning parameters:
k Accuracy Kappa
3 0.9271520 0.8527382
5 0.9274912 0.8522411
7 0.9432690 0.8849233
9 0.9358684 0.8694234
11 0.9326520 0.8624502
13 0.9331550 0.8633457
15 0.9289971 0.8548642
17 0.9258918 0.8484631
19 0.9243392 0.8453284
21 0.9259152 0.8488081
23 0.9223070 0.8411710
25 0.9222310 0.8407475
27 0.9253333 0.8470459
29 0.9290468 0.8548415
31 0.9269415 0.8502645
33 0.9285263 0.8534166
35 0.9316316 0.8598659
37 0.9332105 0.8632230
39 0.9347632 0.8663944
41 0.9342368 0.8652574
43 0.9368684 0.8708486
45 0.9379211 0.8729864
47 0.9342895 0.8655254
49 0.9358158 0.8686271
51 0.9368129 0.8708728
53 0.9379240 0.8732183
55 0.9374240 0.8721588
57 0.9373977 0.8720978
59 0.9410292 0.8795798
61 0.9432164 0.8842024
63 0.9447427 0.8873647
65 0.9462690 0.8905616
67 0.9457427 0.8894259
69 0.9447690 0.8874640
71 0.9395292 0.8768961
73 0.9369240 0.8717375
75 0.9353158 0.8684827
77 0.9321813 0.8623134
79 0.9296023 0.8570235
81 0.9254415 0.8486104
83 0.9232573 0.8443844
85 0.9185965 0.8350012
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 65.
Inspecting the probabilities reveals that a cutoff probability around 0.5 gives the best classification results. In the function predict the cutoff is set to be 0.5 by default which means we do not need to change it.
When testing our model on the test set however, we get an accuracy of 86.54% and an f1 score of 87.72%
confusionMatrix(preds_knn, test_set_knn$Classes,
mode = "everything",
positive="fire")
Confusion Matrix and Statistics
Reference
Prediction not_fire fire
not_fire 20 4
fire 3 25
Accuracy : 0.8654
95% CI : (0.7421, 0.9441)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 2.105e-06
Kappa : 0.7284
Mcnemar's Test P-Value : 1
Sensitivity : 0.8621
Specificity : 0.8696
Pos Pred Value : 0.8929
Neg Pred Value : 0.8333
Precision : 0.8929
Recall : 0.8621
F1 : 0.8772
Prevalence : 0.5577
Detection Rate : 0.4808
Detection Prevalence : 0.5385
Balanced Accuracy : 0.8658
'Positive' Class : fire
After plotting the ROC curve, we get an AUC of 86.58% which is the worst out of all of our models so far.
auc
[1] 0.8658171
The goal of ensemble modeling is to improve performance over a baseline model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build a simple decision tree.
Decision trees are widely used classifiers in industries based on their transparency in describing rules that lead to a prediction. They are arranged in a hierarchical tree-like structure and are simple to understand and interpret. They are not susceptible to outliers and are able to capture nonlinear relationships.
We will be using the rpart library for creating decision trees. (rpart) stands for recursive partitioning and employs the CART (classification and regression trees) algorithm. Apart from the rpart library, there are many other decision tree libraries like C50, Party, Tree, and mapTree.
library(rpart.plot)
Next, we create a decision tree model by calling the rpart function. Let’s first create a base model with default parameters and value. Notice that we do not include any train control meaning that we are not using any bagging, cross validation or pruning techniques. The resulting tree is a simple decision tree. We will explore the performance of the model on the train and test sets next.
base_model <- rpart(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set, method = "class")
summary(base_model)
Call:
rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC +
ISI + BUI + FWI + RH, data = train_set, method = "class")
n= 191
CP nsplit rel error xerror xstd
1 0.9638554 0 1.00000000 1.00000000 0.08253842
2 0.0100000 1 0.03614458 0.09638554 0.03335614
Variable importance
FFMC ISI FWI DMC BUI DC
21 20 18 14 14 13
Node number 1: 191 observations, complexity param=0.9638554
predicted class=1 expected loss=0.434555 P(node) =1
class counts: 83 108
probabilities: 0.435 0.565
left son=2 (80 obs) right son=3 (111 obs)
Primary splits:
FFMC < 0.1608133 to the left, improve=88.02604, (0 missing)
ISI < -0.5036757 to the left, improve=88.02604, (0 missing)
FWI < -0.4281113 to the left, improve=75.38051, (0 missing)
DMC < -0.5269618 to the left, improve=50.85939, (0 missing)
BUI < -0.6424139 to the left, improve=46.48449, (0 missing)
Surrogate splits:
ISI < -0.5518194 to the left, agree=0.990, adj=0.975, (0 split)
FWI < -0.5557897 to the left, agree=0.942, adj=0.862, (0 split)
DMC < -0.5390654 to the left, agree=0.864, adj=0.675, (0 split)
BUI < -0.6424139 to the left, agree=0.853, adj=0.650, (0 split)
DC < -0.6656973 to the left, agree=0.848, adj=0.638, (0 split)
Node number 2: 80 observations
predicted class=0 expected loss=0 P(node) =0.4188482
class counts: 80 0
probabilities: 1.000 0.000
Node number 3: 111 observations
predicted class=1 expected loss=0.02702703 P(node) =0.5811518
class counts: 3 108
probabilities: 0.027 0.973
#Plot Decision Tree
rpart.plot(base_model)
After exploring the confusion matrix and the different performance metrics, we can see that our base decision tree does not fit the data perfectly and has 3 miss-classifications on the training set. Those 3 false positives caused the model’s accuracy to be 98.43% and the f1-score to be 98.63%.
confusionMatrix(preds_tree, train_set$Classes, mode = "everything", positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 80 0
1 3 108
Accuracy : 0.9843
95% CI : (0.9548, 0.9967)
No Information Rate : 0.5654
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9679
Mcnemar's Test P-Value : 0.2482
Sensitivity : 1.0000
Specificity : 0.9639
Pos Pred Value : 0.9730
Neg Pred Value : 1.0000
Precision : 0.9730
Recall : 1.0000
F1 : 0.9863
Prevalence : 0.5654
Detection Rate : 0.5654
Detection Prevalence : 0.5812
Balanced Accuracy : 0.9819
'Positive' Class : 1
Our base decision tree performs very well on unseen data with just one false positive, an accuracy of 98% and an f1-score of 98.31%.
base_model
n= 191
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 191 83 1 (0.43455497 0.56544503)
2) FFMC< 0.1608133 80 0 0 (1.00000000 0.00000000) *
3) FFMC>=0.1608133 111 3 1 (0.02702703 0.97297297) *
This model gives us an AUC of 97.82%
Pre-pruning is also known as early stopping criteria. As the name suggests, the criteria are set as parameter values while building the rpart model. Below are some of the pre-pruning criteria that can be used. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes.
The complexity parameter (cp) in rpart is the minimum improvement in the model needed at each node. It is based on the cost complexity of the model and works as follows:
The cp value is a stopping parameter. It helps speed up the search for splits because it can identify splits that donít meet this criteria and prune them before going too far.
Other parameters include but are not limited to:
maxdepth: This parameter is used to set the maximum depth of a tree. In this prepruning step.
minsplit: It is the minimum number of records that must exist in a node for a split to happen or be attempted.
And one last thing, since we are in a classification setting, we have to specify class as the method used for building our tree instead of ‘anova’ that is used in regression settings.
pruned_base_model <- rpart(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set, method = "class", control = rpart.control(cp = 0, maxdepth = 8, minsplit = 8))
summary(pruned_base_model)
Call:
rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC +
ISI + BUI + FWI + RH, data = train_set, method = "class",
control = rpart.control(cp = 0, maxdepth = 8, minsplit = 8))
n= 191
CP nsplit rel error xerror xstd
1 0.96385542 0 1.00000000 1.00000000 0.08253842
2 0.01204819 1 0.03614458 0.09638554 0.03335614
3 0.00000000 2 0.02409639 0.09638554 0.03335614
Variable importance
ISI FFMC FWI DMC BUI DC
21 21 18 14 13 13
Node number 1: 191 observations, complexity param=0.9638554
predicted class=1 expected loss=0.434555 P(node) =1
class counts: 83 108
probabilities: 0.435 0.565
left son=2 (80 obs) right son=3 (111 obs)
Primary splits:
FFMC < 0.1608133 to the left, improve=88.02604, (0 missing)
ISI < -0.5036757 to the left, improve=88.02604, (0 missing)
FWI < -0.4281113 to the left, improve=75.38051, (0 missing)
DMC < -0.5269618 to the left, improve=50.85939, (0 missing)
BUI < -0.6424139 to the left, improve=46.48449, (0 missing)
Surrogate splits:
ISI < -0.5518194 to the left, agree=0.990, adj=0.975, (0 split)
FWI < -0.5557897 to the left, agree=0.942, adj=0.862, (0 split)
DMC < -0.5390654 to the left, agree=0.864, adj=0.675, (0 split)
BUI < -0.6424139 to the left, agree=0.853, adj=0.650, (0 split)
DC < -0.6656973 to the left, agree=0.848, adj=0.638, (0 split)
Node number 2: 80 observations
predicted class=0 expected loss=0 P(node) =0.4188482
class counts: 80 0
probabilities: 1.000 0.000
Node number 3: 111 observations, complexity param=0.01204819
predicted class=1 expected loss=0.02702703 P(node) =0.5811518
class counts: 3 108
probabilities: 0.027 0.973
left son=6 (3 obs) right son=7 (108 obs)
Primary splits:
ISI < -0.4796039 to the left, improve=2.5230230, (0 missing)
FWI < -0.6297088 to the left, improve=2.5230230, (0 missing)
FFMC < 0.2967052 to the left, improve=2.0878380, (0 missing)
DMC < -0.3978571 to the left, improve=0.9628378, (0 missing)
DC < -0.6331791 to the left, improve=0.8572553, (0 missing)
Surrogate splits:
FWI < -0.6297088 to the left, agree=0.982, adj=0.333, (0 split)
Node number 6: 3 observations
predicted class=0 expected loss=0.3333333 P(node) =0.01570681
class counts: 2 1
probabilities: 0.667 0.333
Node number 7: 108 observations
predicted class=1 expected loss=0.009259259 P(node) =0.565445
class counts: 1 107
probabilities: 0.009 0.991
#Plot Decision Tree
printcp(pruned_base_model)
Classification tree:
rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC +
ISI + BUI + FWI + RH, data = train_set, method = "class",
control = rpart.control(cp = 0, maxdepth = 8, minsplit = 8))
Variables actually used in tree construction:
[1] FFMC ISI
Root node error: 83/191 = 0.43455
n= 191
CP nsplit rel error xerror xstd
1 0.963855 0 1.000000 1.000000 0.082538
2 0.012048 1 0.036145 0.096386 0.033356
3 0.000000 2 0.024096 0.096386 0.033356
rpart.plot(pruned_base_model)
The summary of our base model will give us the details of each split with the number of observations, the value of the complexity parameter, the predicted class, the class counts with their probabilities and the children of the node. It will also give details about the future splits starting with the primary splits that will follow and the percent improvement in the prediction as well as the surrogate splits that come later on.
The resulting tree as explained in the above section, is the smallest tree with the lowest miss-classification loss. This tree is plotted with the split details and leaf node classes.
The optimal CP value found was 0.012048
confusionMatrix(preds, train_set$Classes,
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 82 1
1 1 107
Accuracy : 0.9895
95% CI : (0.9627, 0.9987)
No Information Rate : 0.5654
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9787
Mcnemar's Test P-Value : 1
Sensitivity : 0.9907
Specificity : 0.9880
Pos Pred Value : 0.9907
Neg Pred Value : 0.9880
Precision : 0.9907
Recall : 0.9907
F1 : 0.9907
Prevalence : 0.5654
Detection Rate : 0.5602
Detection Prevalence : 0.5654
Balanced Accuracy : 0.9893
'Positive' Class : 1
The train accuracy of our tree is 98.95% with an f1 score of 99% as well with a total of 2 missclassifications, 1 FP and 1 FN.
We have a 96.15% accuracy on our held-out validation set which means we have successfully avoided over-fitting using tree pruning.
confusionMatrix(preds, test_set$Classes,
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 23 2
1 0 27
Accuracy : 0.9615
95% CI : (0.8679, 0.9953)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 5.691e-11
Kappa : 0.9227
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.9310
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9200
Precision : 1.0000
Recall : 0.9310
F1 : 0.9643
Prevalence : 0.5577
Detection Rate : 0.5192
Detection Prevalence : 0.5192
Balanced Accuracy : 0.9655
'Positive' Class : 1
The AUC was 96.55%
Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.
This method performs best with algorithms that have high variance. The argument method=“treebag” specifies the algorithm. We will train our model using a 5-fold cross validation repeated 5 times. The sampling strategy used for the bagged trees is ROSE.
We achieved a perfect fit using bagged trees trained using a 5-fold CV repeated 5 times.
confusionMatrix(preds, train_set$Classes,
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 83 0
1 0 108
Accuracy : 1
95% CI : (0.9809, 1)
No Information Rate : 0.5654
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Precision : 1.0000
Recall : 1.0000
F1 : 1.0000
Prevalence : 0.5654
Detection Rate : 0.5654
Detection Prevalence : 0.5654
Balanced Accuracy : 1.0000
'Positive' Class : 1
The bagged model did not achieve a perfect performance on unseen data which leads us to believe it over fit the data. This can be caused by the fact that the bagged trees were highly correlated between each other which could be due to the absence of randomization in the features used for each bagged tree. What happened most probably is the use of the same strong predictors in all bagged trees causing this high correlation between them. In order to get rid of it we will implement Random Forests next which adds this randomization in the features selected for each bagged tree. The accuracy we got is 98%
confusionMatrix(preds, as.factor(test_set$Classes),
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 23 1
2 0 28
Accuracy : 0.9808
95% CI : (0.8974, 0.9995)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 2.743e-12
Kappa : 0.9612
Mcnemar's Test P-Value : 1
Sensitivity : 1.0000
Specificity : 0.9655
Pos Pred Value : 0.9583
Neg Pred Value : 1.0000
Precision : 0.9583
Recall : 1.0000
F1 : 0.9787
Prevalence : 0.4423
Detection Rate : 0.4423
Detection Prevalence : 0.4615
Balanced Accuracy : 0.9828
'Positive' Class : 1
Random Forest is an extension of bagged decision trees, where in addition to sampling the data, we also sample the variables in each bagged decision tree. The trees are constructed with the objective of reducing the correlation between the individual decision trees by making sure we do not use the same strong predictors in all bagged trees resulting in strongly correlated trees.
Once again, our random forest model achieved a perfect first with a 5-fold CV repeated 5 times.
confusionMatrix(preds, train_set$Classes,
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 83 0
1 0 108
Accuracy : 1
95% CI : (0.9809, 1)
No Information Rate : 0.5654
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Precision : 1.0000
Recall : 1.0000
F1 : 1.0000
Prevalence : 0.5654
Detection Rate : 0.5654
Detection Prevalence : 0.5654
Balanced Accuracy : 1.0000
'Positive' Class : 1
On our dataset, the random forest did not perform better that the previous models, yielding an accuracy of 96.15%. So far, BAGGING has been the best method.
preds = predict(rf_model,test_set, type="raw")
preds = predict(rf_model,test_set, type="raw")
preds <- mapvalues(preds, from=c(0,1), to=c(1,2))
confusionMatrix(preds, as.factor(test_set$Classes),
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 22 1
2 1 28
Accuracy : 0.9615
95% CI : (0.8679, 0.9953)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 5.691e-11
Kappa : 0.922
Mcnemar's Test P-Value : 1
Sensitivity : 0.9565
Specificity : 0.9655
Pos Pred Value : 0.9565
Neg Pred Value : 0.9655
Precision : 0.9565
Recall : 0.9565
F1 : 0.9565
Prevalence : 0.4423
Detection Rate : 0.4231
Detection Prevalence : 0.4423
Balanced Accuracy : 0.9610
'Positive' Class : 1
In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. We will use the Stochastic Gradient Boosting algorithm.
An important thing to note is that stochastic gradient boosting takes a much longer time to train as it is a step-wise method which takes a lot of iterations to converge. Adding cross-validation makes it even longer.
control <- trainControl(method="cv", number=5)
control <- trainControl(method="cv", number=5)
SGB <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, method="gbm", metric="Accuracy", trControl=control)
100% accuracy on training data.
confusionMatrix(preds, train_set$Classes,
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 83 0
1 0 108
Accuracy : 1
95% CI : (0.9809, 1)
No Information Rate : 0.5654
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Precision : 1.0000
Recall : 1.0000
F1 : 1.0000
Prevalence : 0.5654
Detection Rate : 0.5654
Detection Prevalence : 0.5654
Balanced Accuracy : 1.0000
'Positive' Class : 1
98.08% accuracy on unseen data, same as the training set, boosting brought much improvement over our random forest model even though it took a greater training time.
confusionMatrix(preds, as.factor(test_set$Classes),
mode = "everything",
positive='1')
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 23 1
2 0 28
Accuracy : 0.9808
95% CI : (0.8974, 0.9995)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 2.743e-12
Kappa : 0.9612
Mcnemar's Test P-Value : 1
Sensitivity : 1.0000
Specificity : 0.9655
Pos Pred Value : 0.9583
Neg Pred Value : 1.0000
Precision : 0.9583
Recall : 1.0000
F1 : 0.9787
Prevalence : 0.4423
Detection Rate : 0.4423
Detection Prevalence : 0.4615
Balanced Accuracy : 0.9828
'Positive' Class : 1
Support Vector Machine is a discriminative classifier that classifies obserations using a hyperplane that best differentiates between the classes. Its advantages lay in the fact that they are very flexible and work well on high-dimensional data.
We will use SVM on our dataset to demonstrate its capabilities.
The goal of the SVM is to identify a boundary that minimizes the total distance between the hyper-plane and the closest points on each class.
There are two hyper-parameters to take into consideration before training our SVM model: first, the cost C which acts as a regularization parameter and trades off correct classifications of the training examples against the maximization of the decision boundary. In other words, the greater the value of C the higher the number of errors occurring in the training classifications. The second hyper-parameter gamma defines how much curvature we want in our decision boundary.
We start by tuning our model according to different values of gamma and C. We will start by using a linear kernel.
To use the cross validation functions from the Caret package, we need to turn the 0/1 categorical values of the variable Classes into “fire”/“not_fire” (as required). The functions provided will allow us to find the best values for both gamma and C. we used a tuning length of 10.
train_set_svm$Classes
[1] not_fire not_fire not_fire not_fire fire fire not_fire not_fire
[9] fire fire not_fire not_fire not_fire not_fire not_fire not_fire
[17] fire not_fire not_fire fire fire fire fire not_fire
[25] fire fire fire fire fire fire fire fire
[33] not_fire fire fire fire fire fire fire fire
[41] fire fire not_fire not_fire fire fire fire not_fire
[49] fire fire fire fire not_fire not_fire fire fire
[57] fire fire fire fire fire fire fire fire
[65] fire fire fire fire not_fire fire fire fire
[73] not_fire not_fire fire not_fire not_fire fire not_fire not_fire
[81] fire fire fire fire fire fire fire not_fire
[89] fire not_fire not_fire fire not_fire not_fire not_fire not_fire
[97] not_fire not_fire not_fire fire fire not_fire not_fire fire
[105] fire not_fire not_fire not_fire not_fire not_fire not_fire not_fire
[113] fire fire fire fire fire fire not_fire fire
[121] not_fire not_fire fire not_fire fire not_fire not_fire not_fire
[129] not_fire not_fire not_fire not_fire fire fire fire not_fire
[137] not_fire not_fire fire fire fire fire fire not_fire
[145] not_fire fire fire fire fire not_fire fire fire
[153] fire fire fire fire fire fire fire fire
[161] fire fire fire fire fire fire not_fire not_fire
[169] not_fire not_fire not_fire fire not_fire not_fire not_fire not_fire
[177] not_fire not_fire not_fire not_fire fire fire fire fire
[185] fire not_fire not_fire not_fire not_fire not_fire not_fire
Levels: not_fire fire
We perform cross validation repeated 5 times.
svm_model_radial
Support Vector Machines with Radial Basis Function Kernel
191 samples
9 predictor
2 classes: 'not_fire', 'fire'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 172, 172, 172, 172, 172, 173, ...
Resampling results across tuning parameters:
C ROC Sens Spec
0.25 0.9879874 0.9294444 0.9383636
0.50 0.9920732 0.9294444 0.9458182
1.00 0.9937298 0.9441667 0.9441818
2.00 0.9930960 0.9516667 0.9552727
4.00 0.9933434 0.9591667 0.9534545
8.00 0.9927121 0.9688889 0.9612727
16.00 0.9935934 0.9663889 0.9687273
32.00 0.9944798 0.9519444 0.9687273
64.00 0.9950631 0.9475000 0.9778182
128.00 0.9947879 0.9280556 0.9760000
Tuning parameter 'sigma' was held constant at a value of 0.1459001
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.1459001 and C = 64.
Using the radial kernel resulted in an AUC of 99% while the linear gave us an AUC of 99.8%.
We will then proceed to show the confusion matrix of the model.
confusionMatrix(preds_svm_test_radial,test_set_svm$Classes, positive="fire")
Confusion Matrix and Statistics
Reference
Prediction not_fire fire
not_fire 21 1
fire 2 28
Accuracy : 0.9423
95% CI : (0.8405, 0.9879)
No Information Rate : 0.5577
P-Value [Acc > NIR] : 7.729e-10
Kappa : 0.8825
Mcnemar's Test P-Value : 1
Sensitivity : 0.9655
Specificity : 0.9130
Pos Pred Value : 0.9333
Neg Pred Value : 0.9545
Prevalence : 0.5577
Detection Rate : 0.5385
Detection Prevalence : 0.5769
Balanced Accuracy : 0.9393
'Positive' Class : fire
For the radial kernel: our model gave us an accuracy of 100% on the training set and 94.23% on the test set.
For the linear kernel: the model gave an accuracy of 98.95% on the train set and performed worse than the radial kernel, which is expected since it is less flexible and fits the data less well. On the test set however, it gave us an accuracy of 96.15%.
Overall, the linear kernel did a better job at generalizing on unseen data, which is why we will go for it in our model.
The model with the linear kernel gave us an AUC of 96.10%
auc
[1] 0.9610195
We get an AUC value of 0.953 using the SVM classifier.